# The Design and implementation of DCT/IDCT Chip with Novel Architecture

Kuo-Hsing Cheng\*, Chih-Sheng Huang# and Chun-Pin Lin

Department of Electrical Engineering, Tamkang University, Taipei Hsien, Taiwan, R.O.C. TEL:886-2-26215656 Ext.2731 FAX:886-2-26221565

E-mail: cheng@ee.tku.edu.tw\* cshuang@ee.tku.edu.tw# cplin@ee.tku.edu.tw

## ABSTRACT

In the paper, an efficient VLSI architecture for a 8x 8 twodimensional discrete cosine transform and inverse discrete cosine transform (2-D DCT/IDCT) with a new 1-D DCT/IDCT algorithm is presented. The proposed new algorithm makes all coefficients are positive to simplify the design of multipliers and the coefficients have less round-off error than Lee's algorithm [1]. For computing 2-D DCT/IDCT, the row-column decomposition method is used, and the design of 1-D DCT/IDCT requires only 9 multipliers and 21 adders/subtractors. This chip is synthesized with  $0.6\,\mu$  m standard cell library and 1P3M CMOS technology, and it can be operate up to 100MHz.

#### 1. INTRODUCTION

The DCT is widely used in video coding and image compression such as videoconference and HDTV [2][3]. The fast algorithms for computing 2-D DCT/IDCT can be divided into two classes: (1) The row-column decomposition methods [4][5]. These methods separate the 2-D DCT/IDCT into two 1-D DCT/IDCT with a transpose memory. These use 1-D fast DCT/IDCT algorithm to do the row processing and sent the results into a transpose memory to do the row column exchange, and then using 1-D fast DCT algorithm to do the column processing; (2) The not-row-column decomposition methods [6]. These methods direct use the 2-D DCT/IDCT algorithm to computing 2-D DCT/IDCT. These need less computing stages but cost much more hardware. Therefore, these are more suitable for software implementation than hardware implementation.

Our implementation of a 8x 8 2-D DCT/IDCT chip use the rowcolumn decomposition method. The most efficiency algorithms for computing 1-D DCT/IDCT are Lee's [1] and Hou's [7]. Hou's algorithm has less round-off error than Lee's algorithm [8][9], but some coefficients are positive and some are negative. Generally, if A and B are positive, the design of Ax B is easier than Ax (-B). Thus, we present a new algorithm that makes all the coefficients positive cosine forms to simplify the multiplier designs and have less round-off error than Lee's.

## 2. THE FAST DCT ALGORITHM

The normalized 1-D N-point DCT is defined as follow:

$$Z_{u} = \sqrt{\frac{2}{N}} C_{u} + \sum_{i=0}^{N-1} x_{i} + \cos \frac{(2i+1)u\pi}{2N}, \quad u = 0 \sim N-1$$
(1)

where

$$C_u = \frac{1}{\sqrt{2}}, \text{ for } u = 0$$
  
$$C_u = 1, \text{ for } u = 0 \sim N - 1$$

For simplicity, we neglect the scaling factor  $\sqrt{2}$ , c<sub>u</sub>, then Equation (1) can be written as

 $Z'_{u} = \sum_{i=0}^{N-1} x_{i} \cdot \cos \frac{(2i+1)u\pi}{2N}, \quad u = 0 \sim N - 1$ (2)In the following, it is assumed that the N is a power of 2. Let

u = 2u and u = 2u + 1 to separate Equation (2) into even and odd index forms, we have the even index form as

$$Z'_{2u} = \sum_{i=0}^{N-1} x_i \cdot \cos \frac{(2i+i)2u\pi}{2N}$$
  
=  $\sum_{i=0}^{N-1} x_i \cdot \cos \frac{(2i+i)2u\pi}{2(N/2)}$   
=  $\sum_{i=0}^{N/2-1} x_i \cdot \cos \frac{(2i+i)2u\pi}{2(N/2)} + \sum_{i=0}^{N/2-1} x_{N-1-i} \cdot \cos \frac{[2(N-1-i)+i]u\pi}{2(N/2)}$   
,  $u = 0 \sim N/2 - 1$  (3)

and odd index forms as

$$Z'_{2u+1} = \sum_{i=0}^{N-1} x_i \cdot \cos \frac{(2i+1)(2u+1)\pi}{2N}$$
(4)

$$Z'_{2u-1} = \sum_{i=0}^{N-1} x_i \cdot \cos \frac{(2i+1)(2u-1)\pi}{2N}$$
(5)

In Equation (4) and (5), when u=0, it implies  $Z'_1 = Z'_{-1}$ . If we add Equation (4) and (5) with the trigonometric formula

 $\cos(\alpha + \beta) + \cos(\alpha - \beta) = 2\cos\alpha\cos\beta$ (6)we can get

$$Z''_{2u+1} \equiv Z'_{2u+1} + Z'_{2u-1}$$

$$= \sum_{i=0}^{N-1} x_i \cdot 2 \cos \frac{(2i+1)\pi}{2N} \cdot \cos \frac{(2i+1)2u\pi}{2N}$$

$$= \sum_{i=0}^{N-1} x_i \cdot 2 \cos \frac{(2i+1)\pi}{2N} \cdot \cos \frac{(2i+1)u\pi}{2(\frac{N}{2})}$$

$$= \sum_{i=0}^{\frac{N}{2}-1} x_i \cdot 2 \cos \frac{(2i+1)\pi}{2N} \cdot \cos \frac{(2i+1)u\pi}{2(\frac{N}{2})}$$

$$+ \sum_{i=1}^{\frac{N}{2}-1} x_{N-1-i} \cdot 2 \cos \frac{[2(N-1-i)+1]\pi}{2N} \cdot \cos \frac{[2(N-1-i)+1]u\pi}{2(\frac{N}{2})}$$

$$, \quad u = 0 - \frac{N}{2} - 1$$
(7)

From

 $\cos \frac{[2(N-1-i)+1]u\pi}{2\binom{N}{2}} = \cos \left[ 2u\pi - \frac{(2i+1)u\pi}{2\binom{N}{2}} \right] = \cos \frac{(2i+1)u\pi}{2\binom{N}{2}}$ (8)

Equation (3) can be written as

0-7803-5482-6/99/\$10.00 @2000 IEEE

$$Z'_{2u} = \sum_{i=0}^{N_{2}'-1} \left[ x_{i} + x_{N-l-i} \right] \cos \frac{(2i+1)u\pi}{2(N_{2}')}$$
(9)

(10)

(12)

From

 $\cos\frac{[2(N-1-i)+1]\pi}{2N} = \cos\left[\pi - \frac{(2i+1)\pi}{2N}\right] = -\cos\frac{(2i+1)\pi}{2N}$ and Equation (8), Equation (7) becomes

$$Z''_{2u+1} = \sum_{i=0}^{N_{2}-1} \left[ x_{i} - x_{N-1-i} \right] \cdot 2 \cos \frac{(2i+1)\pi}{2N} \cdot \cos \frac{(2i+1)u\pi}{2(N_{2})}$$
(11)

$$g_i = x_i + x_{N-1-i}$$
,  $i = 1 \sim \frac{N}{2} - 1$ 

$$h_{i} = [x_{i} - x_{N-l-i}] \cdot 2\cos\frac{(2i+l)\pi}{2N} , \quad i = l \sim \frac{N}{2} - l$$
(13)

then Equation (9) and (11) becomes

$$G_{u} \equiv Z'_{2u} = \sum_{i=0}^{N_{2}^{\prime}-1} g_{i} \cdot \cos \frac{(2i+1)u\pi}{2(\frac{N_{2}}{2})}, \quad u = 1 \sim \frac{N_{2}^{\prime}-1}{(14)}$$

$$H_{u} \equiv Z''_{2u+1} = \sum_{i=0}^{N_{2}'-1} h_{i} \cdot \cos \frac{(2i+1)u\pi}{2(N_{2}')}, \quad u = 1 \sim N_{2}' - 1$$
(15)

Equation (14) and (15) are both (N/2)-point DCT, therefore, based on the formations derive from above, Equation (14) and (15) can recursively compute until N=2. The signal flow graph for a  $8 \times 8$ DCT/IDCT with the scaling factor is shown in Figure 1. Because the DCT is an orthogonal transform, the signal flow graph for the IDCT is just the inverse of the DCT.



Figure 1. The DCT/IDCT signal flow graph

## 3. THE SIMPLIFIED OF SIGNAL FLOW GRAPH

In Figure 1, the DCT/IDCT signal flow graph requires 13 multipliers and 29 adders/subtractors. In the following, we will simplify the DCT/IDCT signal flow graph. The simplified flow is as follow:

- To multiply each row of the signal flow graph in Figure 1 with  $(2\cos(1/(4 \pi)))^{-1}$ , it can save one multiplier, which is shown in Figure 2.
- As shown in Figure 2, both the parts enclosed by dash line and real line are the same. If we take away one of them and pipeline the signal flow graph appropriately, the hardware can save eight adders/subtractors and three multipliers. Finally, the hardware comparison is shown in Table 1.



Figure 2. The simplified of DCT/IDCT signal flow graph

|                    | before simplify | after simpilfy |  |
|--------------------|-----------------|----------------|--|
| multipliers        | 13              | 8              |  |
| adders/subtractors | 29              | 21             |  |

#### Table 1. The hardware comparison

In order to pipeline the signal flow graph appropriately, Figure 3 (Figure 4) shows the timing flow of DCT (IDCT), where the number of () means the timing flow along path 1 and the number of [] means the timing flow along path 2.







Figure 4. The timing flow of IDCT

## 4. THE VLSI IMPLEMENTATION

Because 2-D DCT/IDCT is a separable transform, it can be implemented by series of 1-D DCT/IDCTs with a transpose memory. Figure 5 shows the 2-D DCT/IDCT is implemented only one 1-D DCT/IDCT unit with a transpose memory.



Figure 5. The implementation of 2-D DCT/IDCT

Because the direction of signal flow of DCT and IDCT are different, each pipelined stage must include a lot of multiplexes. In order to solve the complicated routing, we make the DCT/IDCT be placed as sandwich form as shown in Figure 6. Therefore, no matter to process the DCT or the IDCT, the wires are routed only through the control unit.



Figure 6. The placement of DCT/IDCT

Figure 7 shows the data format of the DCT/IDCT chip. The DCT/IDCT chip requires seven kinds of the multiplier coefficients. We use Booth coding [10] to reduce the numbers of nonzero bits of the multiplier coefficients as shown in Table. 2.



Figure 7. The data format of DCT/IDCT

|   | coefficients   | decimal | binary                | binary (after Booth coding) | nonzero hits |
|---|----------------|---------|-----------------------|-----------------------------|--------------|
| a | 2cos(pi/16)    | 1.96157 | 01.1111 0110 0010 100 | 10,0000 1010 0010           | 4            |
| ь | 2cos(pi/8)     | 1.84776 | 01.1101 1001 0000 011 | 10,0010 1001 0000           | _4           |
| c | 2cos(3pi/16)   | 1.66294 | 01.1010 1001 1011 011 | 01,1010 1010 0100           | 6            |
| d | 1/[2cos(pi/4)] | 0.70711 | 00.1011 0101 0000 100 |                             | 5            |
| c | 2cos(5pi/16)   | 1.11114 | 01.0001 1100 0111 011 | 01.0010 0100 1001           | 5            |
| Ц | 2cos(3pi/8)    | 0.76537 | 00.1100 0011 1110 111 | 00.1100.0100.0001           | 4            |
| g | 2cos(7pi/16)   | 0.39018 | 00.0110 0011 1110 001 | 00.0110.0100.0010           | 4            |



A 6-bitx 6-bit multiplier implementation uses the Wallace tree

architecture [11] to simplify is shown in Figure 8. For example, Figure 9 shows a multiplier with the coefficient which is equal to  $2\cos(\pi/16)$ . It is important that to simplify the partial product and sign-bit extension of multiplier. Our simplified flow is as follow:

- In Figure 10, let "dddd" be the sign-bit extension, the "dddd" can be represent as "<u>d</u>" added with a "all-ones compensation vector". Using the method shows in Figure 10, the sign-bit extensions in Figure 9 are instead of "all-ones compensation vector", and can be collected beforehand.
- Use the combine skill of sign-bit extension shows in Figure 11, the partial product of multiplier in Figure 12 can reduce one row.
- In Figure 13, the Wallace Tree architecture is used to simplify the partial products to two rows. The two rows just require fast adder to generate the final product.



Figure 8. A 6-bitx 6-bit multiplier simplified with Wallace Tree architecture



Figure 12. The first time simplified of multiplier



#### 5. THE SIMULATION RESULT

The circuit modules of the chip are designed by verilog HDL. The verilog HDL programs are synthesized by the Synopsys tool with the 0.6  $\mu$  m Compass standard cell library. Figure 14 shows the layout which use the Cadence Silicon Ensemble tool to do the automatically place and route with 0.6  $\mu$  m 1P3M technology. The core area is about 3.916x 3.916mm<sup>2</sup>, and the chip can be operate up to 100MHz. The features of our DCT/IDCT chip are shown in Table 3, and the comparisons with other chips are shown in Table 4.



Figure 14. The Layout of the 2-D DCT/IDCT

| Input data format                          | 9-bit (DCT), 12-bit (IDCT)                      |  |
|--------------------------------------------|-------------------------------------------------|--|
| Output data format                         | 12-bit (DCT), 9-bit (IDCT)                      |  |
| Performance of gate level simulation       | 8 ns (125 MHz)                                  |  |
| Performance of transistor level simulation | 10 ns (100 MHz)                                 |  |
| Core area                                  | $3.9165 \times 3.915 \text{ mm}^2$              |  |
| Chip area                                  | $5.5008 \times 5.5008 \text{ mm}^2$             |  |
| Gate count                                 | 38,973.75                                       |  |
| Transistor count                           | 155,895                                         |  |
| Technology Cell Library                    | Compass 0.6 <sup>µ</sup> m Stadard Cell Library |  |
| Process Technology                         | TSMC 1P3M CMOS                                  |  |

Table 3. The feature of the 8x 8 2-D DCT/IDCT chip

|      | Block | Function | Technology ( <sup>µ</sup> m) | Transistors | Max. Speed (MHz) |
|------|-------|----------|------------------------------|-------------|------------------|
| [2]  | 8×8   | DCT/IDCT | 1.2                          | 320,000     | 50               |
| [3]  | 8×8   | DCT/IDCT | 0.8                          | 180,000     | 50               |
| _[5] | 8×8   | DCT      | 0.8                          | 147,839     | 50               |
| [6]  | 8×8   | IDCT_    | 0.6                          | 402,048     | 71               |
| ours | 8×8   | DCT/IDCT | 0.6                          | 155,895     | 100              |

Table 4. The comparison of DCT/IDCT chips

#### 6. CONCLUSION

In the paper, we propose an efficient architecture to implement a 2-D DCT/IDCT with a new algorithm. The proposed new algorithm makes all coefficients are positive to simplify the design of multipliers. The efficient architecture for the proposed algorithm requires only 9 multipliers and 21 adders/subtractors. The transistor count of the designed circuit is less than 160,000. In Table 4, the simulation result shows the performance of our chip is better than other chips, and is suitable for high-speed application such as HDTV.

## REFERENCE

- B. G. Lee, "A new algorithm to compute the discrete cosine transform," *IEEE Trans. Acoust., Speech, and Signal Processing*, vol. ASSP-32, pp. 1243-1245, Dec. 1984.
- [2] Vishnu. Srinivasan, and K. J. Ray Liu, "VLSI Design of High-Speed Time-Recursive 2-D DCT/IDCT Processor for Video Applications," *IEEE Transactions on Circuits and System for Video Technology*, vol. 6, no. 1, February 1996.
- [3] T. Miyazaki, T. Nishitani, M. Edahiro, M. Edahiro, I. Ono, and K. Mitsuhashi, "DCT/IDCT processor for HDTV developed with DSP silicon compiler," *J. VLSI Signal Processing*, no. 5, pp. 151-158, 1993.
- [4] C. C. Ju, "A High-Throughput DCT/IDCT Architecture and Design Methodology with Application to Real-Time Digital Video Codec System and Associated CAD Design," *Master Thesis, National Chiao Tung Univ., Taiwan, June 1997.*
- [5] J. S. Chiang and H. C. Huang, "Novel architecture for twodimensional high throughput rate real-time discrete cosine transform and the VLSI design," *INT. J. Electronics*, vol. 83, no. 4, pp. 519-527, 1997.
- [6] Y. P. Lee, T. H. Chen and L. G. Chen, "A Cost-Effective Architecture for 8x 8 Two-Dimensional DCT/ IDCT Using Direct Method," *IEEE transactions on circuits and systems* for video technology, vol. 7, No. 3, pp. 459-467, June 1997.
- [7] H. S. Hou, "A fast recursive algorithm for computing the discrete cosine transform," *Transactions on Computers*, vol. C-31, pp. 899-906, Sept. 1982.
- [8] K. T. Lo and W. K. Cham, "Analysis of Pruning in Fast Cosine Transform," IEEE Transactions on Signal Processing, vol. 44, no. 3. March 1996.
- [9] H. R. Wu and F. J. Paoloni, "A Two-Dimension Fast Cosine Transform Algorithm Based on Hou's Approach," IEEE Transactions on Signal Processing, vol. 39, no. 2, February 1991.
- [10] C. N. Lyu and D. W. Matula, "Redundant Binary Booth Recoding," Symposium on Computer Arithmetic, pp. 50-57, July 1995.
- [11] William J. Stenzel, William J. Kubitz and Gilles H. Garcia, "A Compact High-Speed Parallel Multiplication Scheme," *IEEE Transactions on Computers*, vol. C-26, No. 10, pp.948-957, 1977.